Pytheas: a software package for the automated analysis of RNA sequences and modifications via tandem mass spectrometry

Mass spectrometry is an important method for analysis of modified nucleosides ubiquitously present in cellular RNAs, in particular for ribosomal and transfer RNAs that play crucial roles in mRNA translation and decoding. Furthermore, modifications have effect on the lifetimes of nucleic acids in plasma and cells and are consequently incorporated into RNA therapeutics. To provide an analytical tool for sequence characterization of modified RNAs, we developed Pytheas, an open-source software package for automated analysis of tandem MS data for RNA. The main features of Pytheas are flexible handling of isotope labeling and RNA modifications, with false discovery rate statistical validation based on sequence decoys. We demonstrate bottom-up mass spectrometry characterization of diverse RNA sequences, with broad applications in the biology of stable RNAs, and quality control of RNA therapeutics and mRNA vaccines.


Additional details on Pytheas matching algorithm and the scoring function.
Pytheas Sp score is a numerical value computed for each oligonucleotide-spectrum match (OSM) and represents the confidence in the sequence assignment based on the quality of the fit between theoretical and experimental MS/MS spectra. The output of the matching and scoring tool in Pytheas contains lists of OSMs that are grouped by the precursor m/z and RT (retention time), sorted in the descending Sp order, and assigned rank (Supplementary Fig. 12). ΔSp is an additional score that is calculated to measure the relative distance between the top and other competing OSMs. Two additional parameters in this work, ΔSp2 and ΔSpD, are used to determine how well the scoring function discriminates between rank 1 OSM and the competing target and the decoy OSMs with lower ranks.
The optimized Pytheas Sp score represented by Equation 1 in Methods consists of three separate terms.
∑ match ∑ all term defines the fraction of an experimental spectrum that can be assigned using L theoretically predicted MS2 ions. term defines the fraction of the predicted fragment ions that were identified in the spectrum. n and L counting of the sequence-defining MS2 ions is executed in a charge-independent manner.
For example, matched c4(-1) and c4 (-2) are counted as a single instance of c4 fragment ion, but c4(-1) and c4(-2) peak intensities equally contribute to ∑ match and ∑ ll . This was set up to alleviate the uncertainty present in the process of MS2 ion charge prediction.
To calculate ∑ match , only intensities of the matched sequence-defining ions are used. The search for sequence-defining ions is executed after precursor ion peak (M) and precursor ion losses (e.g., M-P, M-H2O, M-B etc.) peaks are assigned and excluded from search via the associated mass exclusion windows.
All peak intensities are then normalized to the most intense peak present in the spectrum after the exclusion, and by default a relative 5% intensity threshold applied. ∑ all is then calculated by including all peak intensities in the specified m/z range (e.g.,  that remain after precursor ion and precursor ion losses exclusion and after relative intensity cutoff. This was set to increase sensitivity of the Pytheas score to the spectral features that are more likely represented by the sequence-defining ions. Furthermore, normalization to ∑ match introduces penalties for unassigned peaks in the MS/MS spectrum, which can arise due to chromatographic co-elution of RNA oligonucleotides or otherwise caused by instrument noise. This improves specificity of Sp for the correct identification. Critically, Pytheas does not recognize isotopic envelopes of the fragment ions and only monoisotopic peaks (m) contribute to ∑ match , while m and m+1, m+2 etc. isotopologue peaks contribute to ∑ all . We plan to address this issue in future releases to avoid penalties introduced to the Sp score by the lack of isotopologue peak assignments.  Supplementary Fig. 9-11 and 18). ∑Bs term effectively contributes to discriminate between anagrams (oligonucleotide sequences with same nucleoside composition and precursor ion mass, but different positional order of nucleosides).
The extended definition of B is shown below: Here, i represents the connectivity index for consecutive matches, and ki is a number of matches with connectivity index i found across the ion series s. β is a basal reward value for occurrence of a consecutive match, α is a reward coefficient used to increase the reward for the second, third etc. consecutive matches.
The standard values of α and β parameters were set to 0 and 0.075 respectively during the process of Sp training and validation. However, α and β can be easily changed by the user at the matching and scoring step, for instance to improve identification coverage for long oligonucleotides or oligos within specified size ranges. In his study, α = 2 and β = 0.025 have been used interchangeably with the default values for the analysis of several datasets presented (see Pytheas Database Search in Methods). In general, setting α > 0 increases Sp scores for sequences that are >10 nt in size, and improves sensitivity of their identifications. On the other hand, very large α values (or a combination of small α and large β) will quickly reduce Sp scores for short oligonucleotides (3-5 nt in size). We found empirically that the usage of α = 2 and β = 0.025 introduced minimal sequence length biases (Sp values spread over a narrow 0-2 range), and slightly improved ΔSpD distribution compared to the default α = 0 and β = 0.075 parameter values. As for the global identification coverage, no significant differences (5-10 % more IDs) were observed by using either of these two sets of parameters.
From peptide SEQUEST score to oligonucleotide Pytheas score.
In contrast to limited cleavage of the peptide backbone, CID of RNA leads to more complex fragmentation pattern, with 9-11 ion series consistently observed in the spectrum. We found that including all the series in the original SEQUEST score improved identification. This was particularly important for matching short (3-5 nt) and long (> 10 nt) sequences. In the former case, due to high repeatability of nucleoside units in the sequence, competition between candidate matches is very tight, and large number of identified MS2 ions becomes crucial for correct identification. In the latter case, reduced ionization efficiencies and complex isotopic envelopes makes finding all predicted fragment ions difficult (Supplementary Fig. 18).
Identification of long RNA sequences is thus enabled through larger amount of MS2 ions present, and through utilizing Bs reward for consecutive matches. The major goal of the optimization process was to alleviate the length-dependence of Sp (the "roof"-like behavior, with lower scores observed for short and long oligomers) presented by the original SEQUEST score, while retaining high specificity for correct identification.   Pytheas, in addition to 3' -OH and -P (linear phosphate), to accommodate identification of nucleolytic fragments obtained at low RNase concentrations. Cusativin endonuclease has been recombinantly expressed and purified 2 .

Support of positive ionization mode. Analysis of RNA MS/MS data acquired in positive ionization mode
is implemented in Pytheas by selecting "positive" ion mode option at the in silico digestion step. As an example, we acquired MS/MS data of an 8-mer synthetic oligonucleotide with Agilent Q-TOF. The data were matched against the theoretical library obtained using an extensive set of MS2 charge states and fragmentation series. This was done due to the limited availability of RNA data at positive ionization.
Fragment ion matches are highlighted, color-coded based on their ion series and reported in the table below.
The table shows theoretically predicted m/z values. Usually, spectra acquired via negative ionization consistently display MS2 ions from 9 fragmentation series (shown in the table header), however positive mode spectrum above is mainly represented by c/y (red) and a/a-B/w (green) fragments.  Fig. 15). Rank  Example of a tandem spectrum for a 13-mer synthetic RNA oligo. F is a one letter Pytheas notation and

Support of dissociation methods other than Collision Induced Dissociation (CID
[Am] is an extended notation for 2'-OMe adenosine. Fragment ion matches are highlighted, color-coded based on their ion series and reported in the table below the spectrum using predicted m/z values. Unlike short oligomers, sequences >10-12 nt in length typically observe decreased number of MS2 matches (n/L ~0.4-0.6), that results in lower Sp scores and reduced rate of identification. In the example provided, none of the 9 predicted ion series get consecutive matches across the entire sequence (no sequence read-through observed), and at least 3 out of 9 series (a, b/x) are poorly represented with fragment ions.